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Abstract 

Optimal synthesis of reversible functions is a non-trivial problem. One of the major limiting factors in computing 
such circuits is the sheer number of reversible functions. Even restricting synthesis to 4-bit reversible functions 
results in a huge search space (16! m 2 44 functions). The output of such a search alone, counting only the space 
required to list Toffoli gates for every function, would require over 100 terabytes of storage. 

In this paper, we present an algorithm, that synthesizes an optimal circuit for any 4-bit reversible specification. 
We employ several techniques to make the problem tractable. We report results from several experiments, including 
synthesis of random 4-bit permutations, optimal synthesis of all 4-bit linear reversible circuits, synthesis of existing 
benchmark functions, and distribution of optimal circuits. Our results have important implications for the design 
and optimization of quantum circuits, testing circuit synthesis heuristics, and performing experiments in the area 
of quantum information processing. 
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To the best of our knowledge, at present, physically reversible technologies are found only in the quantum domain [3]. 
However, "quantum" unites several technological approaches to information processing, including ion traps, optics, 
superconducting, spin-based and cavity-based technologies [pj. Of those, trapped ions [5] and liquid state NMR 
(Nuclear Magnetic Resonance) [TU] are two of the most developed quantum technologies targeted for computation 
in the circuit model (as opposed to communication or adiabatic computing). These technologies allow computations 
over a set of 8 qubits and 12 qubits, correspondingly. 

Reversible circuits are an important class of computations that need to be performed efficiently for the purpose 
t-H , of efficient quantum computation. Multiple quantum algorithms contain arithmetic units such as adders, multiplies, 
' exponentiation, comparators, quantum register shifts and permutations, that are best viewed as reversible circuits. 
Moreover, reversible circuits are indispensable in quantum error correction [pj. Often, the efficiency of the reversible 
C*") 1 implementation is the bottleneck of a quantum algorithm (e.g., integer factoring and discrete logarithm [16]) or even 
| a class of quantum circuits (e.g., stabilizer circuits [T]). 

In this paper, we present an algorithm that finds optimal circuit implementations for 4-bit reversible functions. 
The algorithm has a number of potential uses and implications. 

One major implication of this work is that it will help physicists with experimental design, since fore-knowledge 
k>( , of the optimal circuit implementation aids in the control over quantum mechanical systems. The control of quan- 
; I ■ turn mechanical systems is very difficult, and as a result experimentalists are always looking for the best possible 
\ implementation. Having an optimal implementation helps to improve experiments or show that more control over a 
physical system needs to be established before a certain experiment could be performed. 

A second important contribution is due to the efficiency of our implementation — 0.01 seconds per synthesis of an 
optimal 4-bit reversible circuit. The algorithm could easily be integrated as part of peephole optimization, such as 
the one presented in p"3] . 

Furthermore, our implementation allows us to propose a subset of optimal implementations that may be used to test 
heuristic synthesis algorithms. Currently, similar tests are performed by comparison to optimal 3-bit implementations. 
The best heuristic solutions have very tiny overhead, making such a test hard to improve. As such, it would help to 
replace this test with a more difficult one that allows more room for improvement. 
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Figure 1: NOT, CNOT, Toffoli, and Toffoli-4 gates. 

Finally, due to the effectiveness of our approach, we are able to report new optimal implementations for small 
benchmark functions, approximate £(4), the number of reversible gates required to implement a reversible 4-bit func- 
tion, approximate the average number of gates required to implement a 4-bit permutation, and show the distribution 
of the number of permutations that may be implemented with 0..9 gates. 

2 Preliminaries 

In this paper, we consider circuits with NOT, CNOT, Toffoli (TOF), and Toffoli-4 (TOF4) gates defined as follows: 

• NOT(a) : a H- a® 1; 

• CNOT(a,6): a,b^ a,b® a; 

• TOF(a, b, c) : a,b,c ^ a,b,c® ab; 

• TOF4(a, b, c, d) : a, 6, c, d h-> a, b,c,d® abc; 

where © denotes an EXOR operation and concatenation is Boolean AND; see Figure Q] for illustration. These gates 
are used widely in quantum circuit construction, and have been demonstrated experimentally in multiple quantum 
information processing proposals [S]. In particular, CNOT is a very popular gate among experimentalists, frequently 
used to demonstrate control over a multiple-qubit quantum mechanical system. Since quantum circuits describe time 
evolution of a quantum mechanical system where individual "wires" represent physical instances, and time propagates 
from left to right, this imposes restrictions on the circuit topology. In particular, quantum and reversible circuits 
are strings of gates. As a result, feed-back (time wrap) is not allowed and there may be no fan-out (mass/energy 
conservation). 

In this paper, we are concerned with searching for circuits requiring a minimal number of gates. Our focus is 
the proof of principle, i.e., showing that any optimal 4-bit reversible function may be synthesized efficiently, rather 
than attempting to report optimal implementations for a number of potentially plausible cost metrics. In fact, our 
implementation allows other circuit cost metrics to be considered, as discussed in Section [SJ 

In related work, there have been a few attempts to synthesize optimal reversible circuits with more than three 
inputs. Grofie et al. [5] employ SAT-based technique to synthesize provably optimal circuits for some small parameters. 
However, their implementation quickly runs out of resources. The longest optimal circuit they report contains 11 
gates. The latter took 21,897.3 seconds to synthesize — same function that the implementation we report in this paper 
synthesized in .000106 seconds, see Table El Prasad et al. [Tj5] used breadth first search to synthesize 26,000,000 
optimal 4-bit reversible circuits with up to 6 gates in 152 seconds. We extend this search into finding 117,798,040,190 
optimal circuits with up to 9 gates in 10,549 seconds. This is over 65 times faster and 4,500 times more than reported 
in [T3]. Yang et al. [T7] considered short optimal reversible 4-bit circuits composed with NOT, CNOT, and Peres [TTj 
gates. They were able to synthesize optimal circuits for even permutations requiring no more than 12 gates. This 
amounts to approximately one quarter of the number of all 4-bit reversible functions. Our implementation allows 
optimal synthesis of any 4-bit reversible function, and it is much faster. 

2.1 Motivating Example 

Consider the two reversible circuit implementations in Figure[2]of a 1-bit full adder. This elementary function/circuit 
serves as a building block for constructing integer adders. The famous Shor's integer factoring algorithm is dominated 
by adders like this. As such, the complexity of an elementary 1-bit adder circuit largely affects the efficiency of 
factoring an integer number with a quantum algorithm. It is thus important to have a well-optimized implementation 
of a 1-bit adder, as well as other similar small quantum circuit building blocks. 
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Figure 2: (a) a suboptimal and (b) an optimal circuit for 1-bit full adder. 
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In this paper, we consider the synthesis of optimal circuits, i.e., we provably find the best possible implementation. 
Using optimal implementations of circuits potentially increases the efficiency of quantum algorithms and helps to 
reduce the difficulty with controlling quantum experiments. 



3 Algorithm and Implementation 

We first outline our algorithm and then discuss it in detail in the follow up subsections. 

There are N = 2 n ! reversible n- variable functions. The most obvious approach to the synthesis of all optimal 
implementations is to compute all optimal circuits and store them for later look-up. However, this is extremely 
inefficient. This is because such an approach requires Q(N) space and, as a result, at least Q(N) time. These space 
and time estimates are lower bounds, because, for instance, storing an optimal circuit requires more than a constant 
number of bits, but for simplicity, let us assume these figures are exact. Despite considering both figures for space 
and time unpractical, we use this simple idea as our starting point. 

We first improve the space requirement by observing that if one synthesized all halves of all optimal circuits, then 
it is possible to search through this set to find both halves of any optimal circuit. It can be shown that the space 
requirement for storing halves has a lower bound of Cl(y/~N). However, searching for two halves potentially requires 

a runtime on the order of the square of the search space, ^(VA) 2 ^ = 0(A), a figure for runtime that we deemed 
inefficient. Our second improvement is thus to use a hash table to store the optimal halves. This reduces the runtime 
to soft Q(y/~N). While this lower bound does not necessarily imply that the actual complexity is lower than 0(A), 
this turns out to be the case, because the set of optimal halves is indeed much smaller than the set of all optimal 
circuits (an analytic estimate for the relative size of the former set is hard to obtain, though) . Cumulatively, these 
two improvements reduce f2(A) space and fi(iV) time requirement to 0(#halves(A)) space and soft 0(#halves(A)) 
time requirement. These reductions almost suffice to make the search possible using modern computers. 

Our last step, apart from careful coding, that made the search possible is the reduction of the space requirement 
(with consequent improvement for runtime) by a constant of almost 48 via exploiting the following two features. 
First, simultaneous input/output relabeling, of which there are at most 24 (=4!) different ones, does not change the 
optimality of a circuit. And second, if an optimal circuit is found for a function /, an optimal circuit for the inverse 
function, can be obtained by reversing the optimal circuit for /. This allows to additionally "pack" up to twice 
as many functions into one circuit. The cumulative improvement resulting from these two observations, is by a factor 
of almost 2 x 24 = 48. Due to symmetries, the actual number is slightly less. See Table [4] (column 2 versus column 
3) for exact comparison. 

3.1 The search-and-lookup algorithm 

For brevity, let the size of a reversible function mean the minimal number of gates required to implement it. Using 
breadth-first search, we can generate the smallest circuits for all reversible functions of size at most k, for a certain 
value of k. (This can be done in advance, on a larger machine, and need not be repeated for each reversible function.) 

Assume that the given function /, for which we need to synthesize a minimal circuit, has size at most 2k. We can 
first check whether / is among the known functions of size at most k and, if so, output the corresponding minimal 
circuit. If not, then the size of / is between k + 1 and 2k, inclusive, and there exist reversible functions g and h of 
size k and at most k, respectively, such that / = h o g. If we find such g of the smallest size, then we can obtain the 
smallest circuit for / by composing the circuits for g and h. 

Multiplying the above equality by we obtain fo g^ 1 = h. Observe that <7 _1 has the same size as g. Therefore, 
by trying all functions g of size 1,2, ... ,k until we find one such that fog has size k, we can find a g of the smallest 
size. 
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The above algorithm involves sequential access to the functions of size at most k and their minimal circuits and a 
membership test among functions of size k. Since the latter test must be fast and requires random memory access, we 
need to store all functions of size k in memory. Thus, the amount of available RAM imposes an upper bound on k. 

In practice, we store a 4-bit reversible function using a 64-bit word, because this allows for an efficient implemen- 
tation of functional composition, inversion, and other necessary operations. On a typical PC with 4GB of RAM, we 
can store all functions for k = 6. This means that we can apply the above search algorithm only to functions of size 
at most 12. Unfortunately, this will not cover all 4-bit reversible functions. Therefore, further reduction of memory 
usage is necessary. 

3.2 Symmetries 

A significant reduction of the search space can be achieved by taking into account the following symmetries of circuits: 

1. Simultaneous relabeling of inputs and outputs. Given an optimal circuit implementing a 4-bit reversible function 
/ with inputs xo, x%, X2, X3 and outputs j/o, 2/i, 2/2, 2/3 and a permutation a : {0, 1, 2, 3} — > {0, 1, 2, 3}, we can con- 
struct a new circuit by relabeling the inputs and outputs into x ct(0 ) , x a ^ , x a ( 2 ) > 2^(3) an d 2/ CT (o) , 2/a(i) , Va{2) , Va(3) > 
respectively. Then the new circuit will provide a minimal implementation of the corresponding reversible func- 
tion /(j. Indeed, if it is not minimal and there is an implementation of f a by a circuit with a smaller number 
of gates, we can relabel the inputs and outputs of this implementation with er -1 and obtain a smaller circuit 
implementing the original function /. This contradicts the assumption that the original circuit for / is optimal. 

Given / and a, a formula for f a can be easily obtained. Observe that the mapping xq , x\ , x^ , X3 1— > ie ct (o) > x a(i) > x <j(i) > 
35 ff ( 3 ) is a 4-bit reversible function, which we denote by g a . The mapping y CT(0 ) , y a (i), Va{2), Va(3) | -> 2/0, 2/1, 2/2, 2/3 is 
then given by the inverse, g~ . Therefore, the four bit values yo, 2/1,2/2, 2/3 of f a on a four-bit tuple xq, Xi,X2,X3 
can be obtained by applying first g ai then /, and finally g~ x . We obtain f a = g^ 1 o f o g a . We call the set of 
functions j a the conjugacy class of / modulo simultaneous input/output relabclings. 

Since there exist 24 permutations of 4 numbers, by choosing different permutations a, we obtain 24 functions of 
the above form f a for a fixed function /. Some of these functions may be equal, whence the size of the conjugacy 
class of / may be smaller than 24. For example, if /=NOT(a), then there exist only 4 distinct functions of 
the form f a (counting / itself). Our experiments show, however, that for the vast majority of functions, the 
conjugacy classes are of size 24. 

2. Inversion. As mentioned above, if we know a minimal implementation for /, then we know one for its inverse as 
well. 

Note that conjugation and inversion commute: 

(9a 1 f° 9oY X = g^ 1 a f~ l ° 9a- 

For a function /, consider the union of the two conjugacy classes of / and Call the elements of this union 

equivalent to /. It follows that equivalent functions have the same size. Moreover, since gates are idempotent (i.e., 
equal to their own inverses) and their conjugacy classes consist of gates, if we know a minimal circuit for /, we can 
easily obtain one for any function in the equivalence class of /. Formally, if / = g\ o . . . o </ n , where n is the size of / 
and gi are gates, then f^ 1 = g n o . . . o g u and if /' = g^ 1 0/0 g a , then /' = g[ o . . . o g' n , where g[ = g^ 1 o g { o g a arc 
also gates. Our experiments show that a vast majority of functions have 48 distinct equivalent functions. This fact 
can reduce the search space by almost a factor of 48 as follows. 

For a function /, define the canonical representative of its equivalence class. A convenient canonical representative 
can be obtained by introducing the lexicographic order on the set of 4-bit reversible functions, considered as permu- 
tations of {0, 1,2,..., 15} and encoded accordingly by the sequence /(0), /(l), . . . , /(15), and choosing the function 
whose corresponding sequence is lexicographically smallest. Now, instead of storing all functions of size at most k, 
store the canonical representative for each equivalence class. This will reduce the storage size by almost a factor of 
48. Then, we use the Algorithm [1] to search for a minimal circuit for a given reversible function /. 

The algorithm requires a hash table with canonical representatives of equivalence classes of size at most k, together 
with the last gates of their minimal circuits, and lists of all permutations of size at most L — k. We have pre-computed 
the canonical representatives for k = 9 using breadth- first search (see Algorithm [2]) . For efficiency reasons, we store 
the last or the first gate of a minimal circuit for each canonical representative. However, this information is clearly 
sufficient to reconstruct the entire circuit and, in particular, the last gate. Using this pre-computed data, the hash 
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Algorithm 1 Minimal circuit. 

Require: Reversible function / of size at most L. 

Hash table H containing canonical representatives of all equivalence classes of functions of size at most k and the 

last gates of their minimal circuits, k > L/2. 

Lists Ai, 1 < i < L — fc, of all functions of size i. 
Ensure: A minimal circuit c for /. 

if / = IDENTITY then 
return empty circuit 

end if 

Ef <— equivalence class of / 

/ <— canonical representative of Ef 

if / G H then 

A <s— last gate of / 

if / is a conjugate of / then 
let / = g~\o f o g a 

c <— minimal circuit for / o A 
return c o A 
else 

kt f = g~ 1 ° f~ l °9a 
A <- g- 1 oAoj„ 
c <— minimal circuit for A o / 
return A o c 
end if 
end if 

for i = 1 to L — k do 
for g G ^4^ do 

ft <- 5 ° / 

Eh «— equivalence class of /i 

ft «— canonical representative of Eh 

if ft G .ff then 

c g -s— minimal circuit for g 
c/j -s— minimal circuit for ft, 
return c~ 1 o c/j 
end if 
end for 
end for 

return error: size of / is greater than L 



table and the lists of all permutations of size at most L — k are formed at the start-up. An implementation storing 
only the hash table is possible. Such an implementation will require less RAM memory, but it will be slower. We 
decided to focus on higher speed, because Table [3] indicates that we do not need to be able to search optimal circuits 
requiring up to 18 (= 9 x 2) gates, which we could do otherwise by storing only the hash table. 

The correctness of Algorithm [T] is proved as follows. Suppose first that the size of / is at most k. The canonical 
representative / of its equivalence class will have the same size as /, so it will be found in the hash table H. Since A 
is the last gate of a minimal circuit for /, the size of / o A is one less than the size of /. The function / o A (computed 
if / is a conjugate of /) or the function A o / (computed if / is a conjugate of is equivalent to / o A and therefore 
also is of size one less than the size of /. Therefore, the recursive call on that function will terminate and return a 
minimal circuit, which we can compose with A (at the proper side) to obtain a minimal circuit for /. The depth of 
recursion is equal to the size of /, and at each call we do one hash table lookup, one computation of the canonical 
representative, and one conjugation of a gate (the latter can be looked up in a small table). Thus, this part of the 
algorithm requires negligible time. 

If the size of / is greater than k, but does not exceed L, then / = gf o ft for some ft of size k and <?/ of size i, 
1 < i < L — k. Then g = gj 1 G Ai. Once the inner for- loop encounters this g, it will return the minimal circuit 
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Algorithm 2 Breadth-first search. 
Require: k 

Ensure: Lists A; of canonical representatives of size < k; 

Hash table H with these canonical representatives and their first or last gates. 
Let if be a hash table (keys are functions, values are gates) 
if insert(IDENTITY, HAS_NO_GATES) 
A <- {IDENTITY} 
for i from 1 to k do 

for / e U {a" 1 | a 6 Aj_i} do 
for all gates A do 
ft 4- f o A 

<— equivalence class of ft. 
ft <— canonical representative of Eh 
if ft ^ if then 

if ft is a conjugate of ft then 
let ft = g^ 1 o ft o g CT 

if.insert(ft, j^oAo 9(T , IS_A_LAST_GATE) 
else 

let ft = g" 1 o hr 1 o g^. 

if.insert(ft, g" 1 o A o 5<T , IS_A_FIRST_GATE) 
end if 
Aj.insert(ft) 
end if 
end for 
end for 
end for 



for /, because both recursive calls are for functions of size at most k. For a function / of size s > k, the number of 
iterations required to find the minimal circuit satisfies 

s — 1 — k s — k 

i=l i=l 

At each iteration, one canonical representative is computed and looked up in the hash table. Since the size of Aj 
grows almost exponentially (see Table |H left column) , the search time will decrease almost exponentially, and the 
storage will increase exponentially, as k increases. The timings for k = 8,9 measured on two different systems are 
summarized in Table [T] (see Section 0] for machine details). The hash table loading and overall memory usage times 
were 119 seconds, 3.5GB (k = 8) and 1111 seconds, 43.04GB (k = 9). 

It follows from the above complexity analysis that the performance of the following key operations affect the speed 
most: 

• composition of two functions (/ o g) and inverse of a function (/ _1 ), 

• computation of the canonical representative of an equivalence class, 

• hash table lookup. 

In the next Subsection we discuss an efficient implementation of these operations. 

3.3 Implementation details 

As mentioned above, a 4-bit reversible function can be stored in a 64-bit word, by allocating 4 bits for each value 
of /(0),/(l), . . . , /(15). Then the composition of two functions can be computed in 94 machine instructions using 
the algorithm composition and the inverse function can be computed in 59 machine instructions using algorithm 
inverse. 

In order to find the canonical representative in the equivalence class of a function /, we compute / , generate 
all conjugates of / and / , and choose the smallest among the resulting 48 functions. Since every permutation of 
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Tabic 1: Average times of computing minimal circuits of sizes 0..14 (in seconds) 
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Table 2: Parameters of linear hash tables storing canonical representatives. 



k 


7 8 9 


Size 

Memory Usage 
Load Factor 
Average Chain Length 
Maximal Chain Length 


225 2 28 2 32 
256 MB 2 GB 32 GB 
0.58 0.84 0.51 
3.14 9.18 2.63 
92 754 86 



{0, 1, 2, 3} can be represented as a product of transpositions (0, 1), (1, 2), and (2, 3), the sequence of conjugates of / 
by all 24 permutations can be obtained through conjugating / by these transpositions. These conjugations can be 
performed in 14 machine instructions each as in function conjugateOl. 

Two functions can be compared lexicographically using a single unsigned comparison of the corresponding two 
words. Thus, the canonical representative can be computed using one inversion, 23 x 2 = 46 conjugations by trans- 
positions, and 47 comparisons, which totals to 750 machine instructions. 

For the fast membership test, we use a linear probing hash table with Thomas Wang's hash function [TB] (see 
algorithm hash64shif t). 

This function is well suited for our purposes: it is fast to compute and distributes the permutations uniformly over 
the hash table. The parameters of the hash tables storing the canonical representatives of equivalence classes of size 
k, for k = 7, 8, 9 are shown in Table [2] 

4 Performance and Results 

We performed several tests using two computer systems, CS1 and CS2. CS1 is a high performance server with 16 
AMD Opteron 2300 MHz processors, 64 GB RAM, and Seagate Barracuda ES2 SCSI 7200 RPM HDD running Linux. 
CS2 is a laptop Sony VGN-NS190D with Intel Core Duo 2000 GHz processor, 4 GB RAM, and a 5400 RPM SATA 
HDD running Linux. The following subsections summarize the tests and results. 

4.1 Synthesis of Random Permutations 

In this test, we generated 10,000,000 random uniformly distributed permutations using the Mersenne twister random 
number generator [7]. The test was executed on CS1. It took 104,616.716 seconds (about 29 hours) of user time 
and the maximal RAM memory usage was 43.04GB. Note that 1111 seconds (approximately 18 minutes) were spent 
loading previously computed optimal circuits with up to 9 gates (see Subsection 14. 21 for details) into RAM. On average, 
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unsigned64 composition (unsigned64 p, unsigned64 q) { 
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return r; 





















} 

unsigned64 inverse (unsigned64 p) { 
P »= 2; 

unsigned64 q = 1 « (p & 60) ; 
p »= 4; q |= 2 « (p k 60) ; 
p »= 4; q |= 3 « (p k 60) ; 

p »= 4; q |= 15 « (p & 60) ; 
return q; 

} 

unsigned64 conjugateOl (unsigned64 p) { 
p = (p & OxFOOFFOOFFOOFFOOF) I 
((p & 0X00F000F000F000F0) « 4) I 
((p & OxOFOOOFOOOFOOOFOO) » 4); 
return (p k OxCCCCCCCCCCCCCCCC) I 
((p & 0x1111111111111111) « 1) I 
((p & 0x2222222222222222) » 1); 

} 



it took only 0.01035 seconds to synthesize an optimal circuit for a permutation. The distribution of the circuit sizes 
is shown in Table [31 

Note, that the ratio of the number of random permutations requiring 9 gates to the number of all random per- 
mutations, -fQjtf- ~ .005086, is close to the ratio of the number of all permutations requiring 9 gates to the number 
of all permutations, 105 ' 98 ^^ 23 ' 653 « .005066. This implies that the weighted average over the random sample, equal 
to 11.94 gates per circuit, must be close to the actual weighted average. We further use this random sample and the 
results of the optimal 3-bit circuit synthesis |15j to approximate the number of permutations requiring 10 through 17 
gates, see Table SI 

We conjecture that there are no permutations requiring 17 gates, and unlikely many, if at all, that require 16 gates. 
This implies that our search may be performed on a machine capable of storing reduced optimal implementations 
with up to 8 gates, i.e., a machine with 4GB RAM. Further analysis suggests that the search for an optimal circuit 
will complete in the majority of cases (99.999% assuming uniform distribution) if one uses optimal circuits with at 
most 7 gates and stores only the hash table. Such a search requires slightly more than 256M of available RAM, and 
could be executed on an older machine. 

4.2 Distribution of Optimal Implementations 

Table [4] lists the distribution of the number of permutations that can be realized with optimal circuits requiring no 
more than 9 gates. We estimate the number of functions requiring 10. .17 gates using random function size distribution, 
sec Table |31 and optimal synthesis of all 3-bit reversible functions. We used CS1 to run this test, and it took 10,549 
seconds (under 3 hours) to complete using 43.04 GB of RAM. CS2 used 2.74 GB RAM and took 743.401 seconds 
(under 13 minutes) to synthesize optimal implementations with up to 8 gates. 
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long hash64shif t (long key) 
{ 

key = ("key) + (key « 21); // signed shift 

key = key " (key >>> 24) ; // unsigned shift 

key = (key + (key « 3)) + (key « 8); 

key = key " (key »> 14) ; 

key = (key + (key « 2)) + (key « 4); 

key = key " (key »> 28) ; 

key = key + (key « 31) ; 

return key; 

} 



Tabic 3: Distribution of the number of gates required for 10,000,000 random 4-bit reversible functions. 



Size 


Functions 


14 


17,191 


13 


2,371,039 


12 


5,110,943 


11 


2,051,507 


10 


392,108 


9 


50,861 


8 


5,269 


7 


455 


6 


24 


5 


3 



4.3 Optimal linear circuits 

Linear reversible circuits are the most complex part of error correcting circuits [T] . Efficiency of these circuits defines 
efficiency of quantum encoding and decoding error correction operations. Linear reversible functions are those whose 
positive polarity Recd-Muller polynomial has only linear terms. More simply, linear reversible functions arc those 
computable by circuits with NOT and CNOT gates. 

For example, the reversible mapping a, b, c, d i-> b® 1, a® c® 1, d® 1, a is a linear reversible function. Interestingly, 
this linear function is one of the 138 most complex linear reversible functions — it requires 10 gates in an optimal imple- 
mentation. The optimal implementation of this function is given by the circuit CNOT(b,a) CNOT(c,d) CNOT(d,b) 
NOT(d) CNOT(a,b) CNOT(d,c) CNOT(b,d) CNOT(d,a) NOT(d) CNOT(c,b). 

We synthesized optimal circuits for all 322,560 4-bit linear reversible functions. This process took under two 
seconds on CS2. The distribution of the number of functions requiring a given number of gates is shown in Table [5] 

4.4 Synthesis of Benchmarks 

In this subsection, we present optimal circuits for benchmark functions that have been previously reported in the 
literature. Table [S] summarizes the results. The table describes the Name of the benchmark function, its complete 
Specification, Size of the Best Known Circuit (SBKC), the Source of this circuit, indicator of whether this circuit 
has been Proved Optimal (PO?), Size of an Optimal Circuit (SOC), the optimal implementation that our program 
found, and the runtime our program takes to find this optimal implementation. We used CS1 for this test, and 
report the runtime it takes after hash table with all optimal implementations with up to 9 gates is loaded into RAM. 
Shorter runtimes were identified using multiple runs of the search to achieve sufficient accuracy. Please note that we 
introduce the function primesA, which cannot be found in previous literature. Also, the 9-gate circuit for function 
mperk requires some extra SWAP gates to properly map inputs into their respective outputs, indicated by an asterisk. 

4.5 Searching for a Hard Permutation 

We executed a 12-hour search using CS1 to find a permutation requiring more than 14 gates in an optimal implemen- 
tation. To run the search, we used 14- and 13- gate optimal implementations and tried to extend them by assigning 
gates to the beginning and the end of those optimal implementations, computing the resulting function, and verifying 



9 



Table 4: Number of 4-bit permutations requiring 0..9 gates, and estimates of the number of permutations requiring 
10. .17 gates. Note, we were not able to estimate the number of functions for size 15 and 16. 
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??? 




15 


777 
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~ 4.96 x 10 12 
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225,242,556 


7 


932,651,938 


19,466,575 


6 


70,763,560 


1,482,686 


5 


4,807,552 


101,983 


4 


294,507 


6,538 


3 


16,204 


425 


2 


784 


33 


1 


32 


4 





1 


1 



Table 5: Number of 4-bit linear reversible functions requiring 0..10 gates in an optimal implementation. 



Size 


Functions 


10 


138 


9 


13555 


8 


84225 


7 


118424 


6 


72062 


5 


26182 


4 


6589 


3 


1206 


2 


162 


1 


16 





1 



how many gates they require. After the 12 hour search, we were not able to find a permutation requiring more than 
14 gates, indicating further that there are not many such permutations. 

5 Conclusions and Future Work 

In this paper, we described an algorithm that finds an optimal circuit for any 4-bit reversible function. Our goal 
was to minimize the number of gates required for function implementation. Our program implementation takes 
approximately 3 hours to calculate all optimal implementations requiring up to 9 gates, and then an average of about 
0.01 seconds to search for an optimal circuit of any 4-bit reversible function. Both calculations are surprisingly fast 
given the size of the search space. 

We demonstrated the synthesis of 117,798,040,190 optimal circuits in 10,549 seconds, amounting to an average 
speed of 11,166,749 circuits per second. This is over 65 times faster and some 4,500 times more than the best 
previously reported result (26 million circuits in 152 seconds) |13j . We synthesized optimal implementations for all 
linear reversible functions. 

We also demonstrated that the search for an optimal circuit can be done very quickly. For example, if all optimal 
circuits are written to a hypothetical 100+TB 5400 RPM hard drive, the average time to extract a random circuit from 
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Tabic 6: Optimal implementations of benchmark functions. 
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U,lU,lo,y,z,4,14,llJ 


12 


m 


No 


12 


NOT(a) CNOT(c,a) CNOT(a,d) TOF(a,b,d) 
OlNUl(u,aj lvjr(c,a,Dj lUr(a,d,cj l Ur (,D,c,aj 
TOF(a,b,d) NOT(a) CNOT(d,b) CNOT(d,c) 


.000690s 


/IKif 1 ft 
4D1I- i-o 
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the drive would be expected to take on the order of 0.01 — 0.02 seconds (typical access time for 5400 RPM hard drives). 
In other words, it would likely take longer to read the answer from a hypothetical hard drive than to compute it with 
our implementation. Furthermore, the 3-hour calculation of all optimal circuits with up to 9 gates could be reduced 
by storing its result (computed once for the entirety of the described search and its follow up executions) on the hard 
drive, as was done in Subsection 14.11 It took 1111 seconds, i.e., under 18 minutes, to load optimal circuits with up 
to 9 gates into RAM using CS1. Given that the media transfer rate of modern hard drives is lGbit/s (=1GB in 8 
seconds) and higher, it may take no longer than 5 minutes (= 300s > 296 = 37 * 8s) to load optimal implementations 
into RAM to initiate the search on a different machine. 

Minor modifications to the algorithm could be explored to address other optimization issues. For example, for 
practicality, one may be interested in minimizing depth. This may be important if a faster circuit is preferred, 
and/or if quantum noise has a stronger constituent with time, than with the disturbance introduced by multiple gate 
applications. It may also be important to account for the different implementation costs of the gates used (generally, 
NOT is much simpler than CNOT, which in turn, is simpler than Toffoli). Both modifications are possible, by making 
changes only to the first part of the search. 

To optimize depth, one needs to consider a different family of gates, where, for instance, sequence NOT(a) CNOT(6, c) 
is counted as a single gate. To account for different gate costs, one needs to search for small circuits via increasing 
cost by one (assuming costs arc given as natural numbers), as opposed to adding a gate to all maximal size optimal 
circuits. 

It is also possible to extend the search to find optimal implementations in restricted architectures (trivially if an 
optimal implementation is required up to the input /output permutation). Finally, the search could be extended to 
find some small optimal 5-bit circuits. However, more work is necessary to determine how far the search progress 
could be carried with 5-bit optimal implementations. 

Our future plans include: 
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• execution of an extended search to find a hard permutation, one requiring a large (> 15) number of gates; 

• construction of a representative set of functions that could be used to test heuristic synthesis algorithms against; 

• computing all numbers in Table 2] exactly. If completed, this automatically solves the first task by appropriately 
keeping track of relevant computations. Also, this would give an exact number for L(4), the maximal number 
of gates required to implement a 4-bit reversible function, and help with the second task; 

• finding depth-optimal 4-bit circuits and optimal 5-bit circuits. In particular, a simple calculation shows that 
using CSl it is possible to compute all optimal 5-bit circuits with up to six gates, and thus it is possible to 
search optimal 5-bit implementations with up to 12 gates; 

• extending techniques reported in this paper to the synthesis of optimal stabilizer circuits. Coupled with peep-hole 
optimization algorithm for circuit simplification, these results may become a very useful tool in optimizing error 
correction circuits. This may be of a particular practical interest since implementations of quantum algorithms 
may be expected to be dominated by error correction circuits. 
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